The application of machine learning methods to the prediction of novel ligands for RORγ/RORγT receptors

In this work, we developed and applied a computational procedure for creating and validating predictive models capable of estimating the biological activity of ligands. The combination of modern machine learning methods, experimental data, and the appropriate setup of molecular descriptors led to a set of well-performing models. We thoroughly inspected both the methodological space and various possibilities for creating a chemical feature space. The resulting models were applied to the virtual screening of the ZINC20 database to identify new, biologically active ligands of RORγ receptors, which are a subfamily of nuclear receptors. Based on the known ligands of RORγ, we selected candidates and calculate their predicted activities with the best-performing models. We chose two candidates that were experimentally verified. One of these candidates was confirmed to induce the biological activity of the RORγ receptors, which we consider proof of the efficacy of the proposed methodology.


Introduction
Nuclear receptors are a superfamily of transcription factors whose activity is regulated by the binding of a specific ligand [1].In the human genome, 48 genes encode nuclear receptors that are implicated in various physiological processes, including development, differentiation, reproduction and homeostasis [1,2].One of these nuclear receptor genes, RORC, encodes two isoform proteins named ROR and RORT under alternative promoters [3,4].The two isoforms exhibit divergence in their N-terminal regions, spanning 21 amino acids, and they also display distinct tissue distributions and functions.The longer ROR variant is broadly expressed and is implicated in the regulation of the circadian cycle and metabolism of lipids and glucose [5][6][7].Conversely, the shorter RORT isoform is confined to Th17 lymphocytes, where it directly regulates the differentiation of these lymphocytes from naïve CD4+ cells Crome et al. [8].These Th17 cells, known for their production of interleukin 17 (IL-17A/F), form a critical part of the adaptive immune system, contributing to the defense against pathogens (e.g., Staphylococcus aureus, Bacillus anthracis, and Candida albicans) primarily at mucosal barriers [9][10][11].However, dysregulation leading to the overactivation of Th17 lymphocytes has been linked to tissue damage in certain autoimmune diseases, e.g., multiple sclerosis [12], psoriasis [13], and rheumatoid arthritis [14].It has been determined that several * Corresponding author.
small molecules that bind the ligand-binding domain (LBD) of RORT protect against autoimmune diseases in mouse models, similar to RORC knockdown in mice [15].Thus, the molecules acting as inverse agonists of RORT, inhibiting the activity of this transcription factor, and consequently, diminishing the pathological function of Th17 cells in autoimmune disorders [16][17][18] are under intense investigation [19,20].Upon binding of an inverse agonist, a conformational change occurs in the RORT LBD that displaces the coactivators and recruits corepressors [7,18,21].As a result, RORT can no longer bind the ROR response element (RORE) [16] within the regulatory regions of its target genes.However, some inverse agonists may also bind previously unknown allosteric binding pockets, causing a conformational change in the RORT LBD, that prevent the binding of coactivators [22].Within this study, we propose a set of carefully developed QSAR models that can predict the biological activity of small molecules in the context of the ROR and RORT receptors.We employed six commonly applied machine learning methods, i.e.: extreme gradient boosting [23] as implemented in the XGBoost library [24], multilayer perceptron [25,26] as implemented in the TensorFlow library [27], support vector machines [28][29][30], random forest [31,32], ridge [33,34] and bagging as implemented in the scikit-learn library [35].These methods differ significantly in the level of sophistication and thus in overall performance.Within the current study, we created a generic Python application that can be applied for https://doi.org/any QSPR/QSAR problem.This application will become freely available in the near future, and an in-depth description will be provided in a follow-up article.The application can easily be applied to crosscheck the current study and to pursue further investigation.Due to the level of abstraction, the code can easily be extended to alternative approaches, at the levels of both molecular feature creation and predictive analytics techniques.The presented approach, which is based on cheminformatics and machine learning methods, allows us to perform efficient in silico analysis of the properties of a given molecule with respect to the ligand-binding domain of the ROR and RORT receptors and to determine the probability of their interaction followed by an expected biological effect.This approach provides a tool for much faster prediction of potential ligands and increases the efficiency of drug discovery compared to tedious and time-consuming screening performed in a laboratory.This methodology enables the analysis of potential interactions of various molecules with any proteins acting as receptors.Interestingly, the potential specificity of a given compound toward multiple proteins can also be examined, which may be crucial for the investigation of drug-drug interactions or the prediction of potential side effects of a new drug candidate.The efficiency of the method has been experimentally proven.Developed machine learning models have been applied for the virtual screening of the ZINC20 database, and two unknown ROR and RORT ligands have been identified thus far.

In-silico techniques 2.1.1. Experimental data
The data were extracted from the ChEMBL database, version 3.0 [36].The ChEMBL database is a manually curated, commonly available repository of bioactive molecules with drug-like properties.Currently, the database contains information on more than 2 million molecules and almost 20 million activity measurements in the context of almost 15 thousand targets.The ChEMBL database is a valuable source of experimental data that provides the fundamental bases for many QSAR investigations.Within the current study, we extracted 7307 activities related to the Homo sapiens ROR receptor from this database.These entries reflect various measured quantities, such as IC50, EC50, efficacy, and inhibition (Fig. 1).After extracting the entries related to the IC50 values and removing the rows that did not contain credible "standard value" and "standard unit" fields, 3246 measurements of activity remained.It is quite common to have more than one activity entry for certain molecules; thus, after aggregating the entries per molecule, the effective number of species was 2472.For the majority of the collected molecules, there is only a single IC50 entry, but in certain cases, the number of measurements reaches 15.Sometimes the information is confusing, i.e., the IC50 values for a given molecule span a broad range.To cope with this problem, we excluded entries for which the standard deviation of available measurements exceeded the threshold value of 100 nM.For the remaining cases with multiple IC50 entries, we implemented two strategies: either we chose the arithmetic mean or the median value.We have also tried not remove the observations; in this case, all activity measurements were taken into account, and the mean values were used as a final aggregate.For the regression models we applied logarithmic scaling, and the target quantity for the regression problems was chosen as the negative of natural logarithm of IC50.For the classification problems, on the other hand, we split the species into "active" and "inactive" groups with the threshold value being arbitrarily chosen at the level of 1000 nM.

Molecular feature creation
Within this study, we have proposed molecular representation in the form of molecular fingerprints and molecular descriptors.The initial form of the molecule is expected to be provided as SMILES code [37].Within the implemented transformation pipeline, the SMILES codes can be converted to a fingerprint representation, and currently, all the methods of fingerprint calculation available in the RDKit library [38] can be applied.In particular, the RDKit-specific fingerprints inspired by the public descriptions of the daylight fingerprint [39], the atom pair [40], the topological torsion [41], the Morgan fingerprints, [42] and MACCS keys [43] can be applied to create the molecular representation of the considered species.On the other hand, the molecular representation can be further augmented by extending the space with selected molecular descriptors.To achieve this step, we incorporated the capabilities available in the Mordred software [44], which is an open-source library providing an implementation of 1825 various molecular descriptors.Within this study, we selected possible ways of molecular representation creation.There were attempts to apply molecular fingerprints with certain internal parametrization, i.e., the length of bit representation and path size.On the other hand, the classical QED descriptors have been applied to the molecular descriptor space [45].There were also attempts to combine both molecular descriptors and fingerprint representations.These efforts aimed to determine the best performing molecular representation that could be considered a good practice for these and future efforts.It is worth mentioning that molecular features representation limited to molecular fingerprints only, especially when considering short vector lengths, can be insufficient for QSAR/QSPR studies.

Machine learning training
Each predictive methodology involves a certain number of hyperparameters.The particular choice of hyperparameters has a great impact on the overall predictive model capabilities; therefore, it is crucial to determine them systematically.Here, we have applied the dedicated framework Hyperopt [46], which implements the heuristics of the tree of Parzen estimators [47].This efficient tool supports searching the space of hyperparameters.Within the current study, we have defined the hyperparameter space as follows, depending on the methodology: • XGBoost: n_estimators (number of gradient boosting trees), max_depth (maximum tree depth), gamma (minimum loss reduction required to make further split on a leaf), rel_alpha (L1 regularization term), reg_lambda (L2 regularization term), min_child_weight (minimum sum of instance weight needed in a child), colsample_bytree (subsample ratio of columns when con-structing each tree), eta (learning rate), subsample (subsample ratio of the training instance) Each methodology has been carefully investigated in the context of the regression and classification predictive capabilities.By applying the well-established method of hyperparameter optimization mentioned above, the impact of their random (or default) selection has been largely mitigated.Thus, the final models exploited the majority of their predictive capabilities.

The ZINC20 chemical space search and applicability domain
The final, best performing models were applied to search the ZINC20 database [48].We selected all known ligands of ROR receptors as good similarity-search candidates.The search space was spanned by approximately 880 million compounds available in the ZINC20 database.For all these species, we calculated the 1024-bit Morgan fingerprints with the chemfp library [49].In the next step, we calculated the Tanimoto similarity coefficients between the known ROR ligands and ZINC20 members.We have assumed here that a potential candidate must be at least 70% similar to a known ROR ligand.This approach resulted in fewer than 2000 compounds for which the predictive models were applied.To estimate the applicability domain of our models, we designed the chemical space as two T-distributed stochastic neighbors embedding (t-SNE) [50] components calculated on the basis of the 1024-bit molecular fingerprint.Within this space, all the training points and the ZINC20 candidates are shown.

Molecular docking calculation
To qualitatively and quantitatively assess the interaction between the two hit molecules and the ROR receptor, we performed molecular docking calculations.We applied Smina software [51], and the structure of the receptor was downloaded from the PDB database under the identifier 4WPF [52].The geometry of the receptor was prepared in Chimera 1.16 software [53] and reflects the agonistic conformation of the ROR receptor.

Cell viability
Cell viability after treatment with the analyzed compounds was determined using the CellTiter-Glo ® Luminescent Cell Viability Assay from Promega (Fitchburg, WI, USA) according to the manufacturer's instructions.Luminescence was measured using Infinite ® 200 PRO (Tecan, Männedorf, Switzerland).

Transient transfection and luciferase measurements
The reporter HepG2-ROR cells were seeded on 96-well white plates at a density of 15,000 cells per well.Twenty-four hours later, the cells were transfected with control CMV-XL5 or an expression vector encoding the ROR transcription factor, ROR (Origene, Rockville, MD, USA) or ROR [56] (a kind gift from Dr. R. Schuele, Freiburg, Germany) transcription factors and pCMV6-SEAP (a gift from Dr. S. Schlatter, Zurich) using TurboFect Transfection Reagent (Thermo Fisher Scientific, Waltham, MA, USA).The next day, the cells were treated with the analyzed compound, and after 24 h, they were lysed and subjected to luciferase measurements, which were determined using an Infinite ® 200 PRO (Tecan Group, Männedorf, Switzerland).As a substrate for luciferase, D-luciferin from Cayman Chemical Company (Ann Arbor, MI, USA), was selected.The secreted alkaline phosphatase activity was spectrophotometrically determined at 405 nm and was utilized as a control of transfection efficiency.

CD4+ cell isolation and Th17 polarization
Naïve CD4+ lymphocytes were isolated using CD4 M-pluriBeads ® anti-hu (pluriSelect Life Science, Leipzig, Germany) from buffy coats of healthy, anonymous donors.Buffy coats were purchased from the Regional Center for Blood Donation and Blood Treatment, Łódź, Poland, as waste material.To obtain fully differentiated Th17 lymphocytes from naïve CD4+ cells, we used a protocol that was previously described by Wilson et al. [57].Briefly, freshly isolated cells were cultured in RPMI 1640 medium (PAN-Biotech, Aidenbach, Germany) containing 1% human AB serum, 50 ng/mL human IL-1b, 30 ng/mL human IL-6, 10 ng/mL human IL-23, 10 ng/mL human TGF- and beads coated with anti-CD2, anti-CD3, and anti-CD28 antibodies (T-Cell Activation/Expansion Kit from Miltenyi Biotec, Bergisch Gladbach, Germany) for 5 days in the presence of increasing concentrations of the analyzed compounds.Cytokines were purchased from PeproTech (Rocky Hill, NJ, USA).

Statistics
The results obtained from cell lines were analyzed using one-way ANOVA followed by Tukey's post hoc-test.The results from primary human cells were analyzed with Friedman repeated-measures ANOVA on ranks followed by Fisher's LSD method for all pairwise multiple comparisons.A  < 0.05 value was considered statistically significant.Statistical analysis was performed using SigmaStat ver.4.0 (Systat Software Inc., San Jose, CA, USA).

In-silico results
For each of the methodologies, we have prepared a family of predictive models solving the classification and regression problem.The former models predict the activity of the molecule expressed as a binary value depending on the IC50 value.When the IC50 value is below the threshold value, here set to be 1000 nM, then the molecule is considered active; otherwise, it is considered inactive.For some methodologies, the returned value is only an active/not-active category; for other methodologies, the logit value is provided.The latter is the result of sigmoid transformation and might be considered as the probability that the molecule will fall into the "active" category.The probability value, in addition to the information about the activity, also delivers information about the confidence of the model.Ideally, if the model worked perfectly on a nonconfusing dataset, the predictions would be 0 and 1 for the nonactive species and active species, respectively.However, in real-world applications, when both the methodology and data are imperfect, the prediction is the value from the (0, 1) interval.Therefore, the confidence of the classifier can be modeled as the difference between a certain prediction and a value of 0.5, which reflects the model uncertainty.
Within each methodology, we considered multiple variants of the predictive models.All results are collected in Tables 1, 2, 3, 4, 5 and 6 for the random forest, support vector machine, XGBoost, ridge, multilayer perceptron and bagging classifiers, respectively.The results for the models that solve the regression problems are shown in Tables 7,  8, 9, 10, 11 and 12, in the same order of methodologies.The meaning of a particular variant of the method is encoded in the column "Predictive model variant".The encoding appearing in this column contains 5 comma-separated components with the following meaning: • Methodology acronyms: "RF", "SVM", "XGBoost", "Ridge", "MLP" and "Bagging" for random forest, support vector machines, XG-Boost, ridge, multilayer perceptron and bagging • Molecular features components: "MD", "FP", and "FPMD".In the case of an "MD" entry (molecular descriptors), the feature space is spanned by: -Eight molecular descriptors: molecular weight, octanol/water partition coefficient, number of hydrogen bond donors, number of hydrogen bond acceptors, molecular polar surface area, number of rotatable bonds, number of aromatic rings and molar refractivity, -Three drug-likeness indices: QED [45], Lipinski "rule-of-five" [61,62], and Ghose filter.[63] If "FP" is present (Fingerprint), then the circular Morgan fingerprint [42] with a length of 1024 is calculated.If both entries are present, i.e., "FPMD", then the molecular feature space was spanned both by the molecular descriptors and fingerprints.
• Principal component analysis (PCA) features.In the case of molecular fingerprints ("FP"), we have provided a PCA-transformed version with a differing number of principal components.This ap-proach results in four additional variants of the predictive models accounting for 128, 256, 512 and 1024 principal components denoted as "PCA128", "PCA256", "PCA512" and "PCA1024", respectively.The PCA components were sorted according to the variance gain; thus, by extending the PCA space, we incorporated additional components with decreasing variance gain.The base variant, i.e., the variant that avoids the PCA transformation is noted as "NoP-CA".• Raw data aggregation strategy.Raw biological data are ambiguous for some compounds.Thus, multiple IC50 entries are generated for the same molecule.However, for the final predictive model creation, a single value is expected.For some compounds, the experimental value spread was significant.In extreme cases, the same compound could be concluded to be active based on certain experimental values, i.e., the IC50 values were well below 1000 nM, and inactive based on experimental values, i.e., the IC50 values were well above 1000 nM.Therefore, we introduced the standard deviation as a measure of experimental value spread and a threshold value of 100 nM above which the compound was disregarded.The remaining multiple entries were aggregated as the mean or median.Three possibilities exist: -"median100": the threshold value of 100 nM and the median as an aggregation function, -"mean100": the threshold value of 100 nM and the arithmetic mean as an aggregation function, -"meanNoLim": no standard deviation threshold and arithmetic mean as an aggregation function.• The dataset used for calculating the quality measures.Here we have two options: -"test": the test set of 10% of the data extracted prior to the training (unseen data), -"train": the training set (provided in the supplementary material).
Each row in the tables is related to a carefully determined predictive model with optimized hyperparameters according to a certain strategy.The hyperparameter space was defined according to the nature of the methodology; the section Machine Learning training provides details.The optimization of the hyperparameter space was carried out with the hyperopt module [46], the number of evaluations of the goal function was set to 200, and the optimization goal function was chosen to maximize the precision value or to minimize the mean square error (MSE) for the classifiers and regressors, respectively.The goal function was calculated in the 5-fold cross-validation scheme to avoid the impact of a particular split on the resulting quality measures.Prior to the training, 10% of the data were extracted and later applied for a final quality estimation.For each predictive model, exactly the same splitting strategy was applied; therefore, the quality measures presented in the tables refer to the same molecules.

Classifiers
The quality of each classifier is summarized by seven quality measures: precision, recall, accuracy, F1-score, area under the receiver operator curve (ROCAUC), average precision (AP) and Matthew correlation coefficient.All detailed results are presented in Tables 1, 2, 3, 4, 5 and 6.Each methodology was considered in 18 different scenarios, as explained above.The exception here is the MLP method, for which we did not consider the "NoPCA" variants of the cases involving the molecular fingerprints due to poor performance of the MLP architectures with sparse data.The predictive capabilities differ depending on the methodology itself and the particular variant of this methodology.Table 13 summarizes the best classifiers from each methodology; these results are shown graphically in Fig. 2. Our ultimate goal was to create efficient models capable of predicting biologically active compounds.The efficiency here simultaneously means high precision and recall, since we want to avoid false-positives and false-negatives.Among all consid-

Table 1
The results obtained for the classifiers based on the random forest method.The presented results are for the randomly chosen hold-out set of the compounds.

Table 2
The results obtained for the classifiers based on the support vector machines method.These results are for the randomly chosen hold-out set of the species.

Table 3
The results obtained for the classifiers based on the XGBoost method.These results are for the randomly chosen hold-out set of the species.

Table 4
The results obtained for the classifiers based on the ridge method.These results are for the randomly chosen hold-out set of the species.

Table 5
The results obtained for the classifiers based on the multilayer perceptron architecture.These results are for the randomly chosen hold-out set of the species.

Table 6
The results obtained for the classifiers based on the bagging method.These results are for the randomly chosen hold-out set of the species.(average precision), the XGBoost model seems to perform slightly better.The precision-recall plot of this model is presented in Fig. 3.The analogous plots of remaining considered models as well as the confusion matrices and ROC curves are provided in the supplementary information.

Regressors
The quality estimation of the regression predictive models was established based on the R 2 coefficient, mean squared error (MSE), mean squared logarithmic error (MSLE), mean absolute error (MAE) and

Table 7
The results obtained for the regressors based on the random forest method.The Presented results are for the randomly chosen hold-out set of the species.

Table 8
The results obtained for the regressors based on the support vector machines method.The presented results are for the randomly chosen hold-out set of the species.

Table 9
The results obtained for the regressors based on the XGBoost method.The presented results are for the randomly chosen hold-out set of the species.

Table 12
The results obtained for the regressors based on the bagging method.The presented results are for the randomly chosen hold-out set of the species.7, 8, 9, 10, 11 and 12. Similar to classifiers, the quality of the regressors depends on the methodology and chosen strategy.The best performing models are summarized in Table 14 and Fig. 5.It can be clearly concluded that the best model was obtained with the SVM method and is characterized by the lowest MSE value (below 0.3 log units) and the highest R 2 coefficient.In Fig. 4, we present the plot of the predicted pIC50 versus the actual values.The analogous plots of the remaining regression predictive models obtained for both the training and test sets are provided in the supplementary information.

ZINC20 search
All known ROR ligands were compared with the ZINC20 database, and the most similar species were retrieved.ZINC20 virtual screen-   ing led to 1673 compounds.For each of these compounds, we applied the leading predictive models, i.e., those for which the best quality measures were achieved (the section In-silico results provides details).Among these molecules, 220 were characterized by a classifier probability above 0.9.The majority the predicted IC50 values for the ZINC20 candidates varied in the range of 35 nM-16.5 μM.There were three predictions above 50 μM, but for clarity, they were removed.Interestingly, the distribution of the classifier probabilities resembles a uniform distribution, whereas the predicted pIC50 values form a Gaussian-like distribution (Fig. 6 and Fig. 7).We focused our attention on species for which the classifier probabilities exceeded 0.9.Within this set, 80% of the regressor predictions are below 5.0 μM; however, the remaining 20% exceed this value (Fig. 8).This finding clearly shows an inconsistency between the classifier prediction and the regressor prediction.From these molecules, based on the predictions, commercial availability and price, we selected two molecules for further experimental validation (Fig. 9).Both compounds are characterized by a high probability of biological activity and a relatively low IC50 value.These molecules were also carefully inspected against the area of applicability of the QSAR models.Based on the proposed protocol (the section The ZINC20 chemical space search and applicability domain provides technical details), we created a visualization of the training set and these molecules within the chemical space (Fig. 10).For both, there are training representatives in the nearest neighborhood; therefore, we concluded that they are within the area of applicability of the developed models.Interestingly, among the molecules resulting from the ZINC20 database screening, we identified one species, TMP920, for which experimental data were available [16], even though it was not yet contained in Fig. 8.The pIC50 values of the ZINC20 subset.The distribution of predicted IC50 values of ZINC20 candidates for which the classifier probability was above 90%.The majority of the most promising candidates are focused in the area below 5000 nM.This part of the distribution resembles skewed Gaussian distribution.
Fig. 9.The chosen compounds.The chemical structures of compounds that were chosen for further experimental verification.Together with the chemical structures, predictions of biological activity are provided.
the ChEMBL database.As a result, this molecule was not present in the training data utilized in this study.However, the regressor predicted an IC50 value of 37 nM, whereas the measured value was 30 nM [16].This excellent agreement is an additional presumption supporting the results and reinforcing the methodology proposed here.The entire list of 1673 compounds considered within the virtual screening discussed here, together with the predicted biological activities, is available in the Supplementary Information.

Molecular docking
The two experimentally verified compounds were docked to the agonistic conformation of the ROR receptor (see section 2.1.5for further details related to the applied methodology).The binding energy of both of them amounts to -8.6 kcal/mol.The final poses are presented in Fig. 11.In the case of both ligands, one hydrogen bond was formed between the oxygen of the nitro group of the ligand and the hydroxyl group of glutamic acid (GLU-379).The remaining ligand-protein interactions are of nonpolar nature and are formed deeper in the hydrophobic part of a binding pocket.They are mainly related to the aromatic interaction between the phenyl ring of the ligand and the phenyl rings of surrounding amino acids, mainly PHE-387, PHE-388, and HIS-323.Based on these results, it is difficult to explain noticeable differences in the biological activity of 2C3MP and 2IP6MP compounds.Most likely, subtle differences in the protein-ligand aromatic interactions in the hydrophobic cavity lead to different strengths of the HIS-479 -TYR-502 hydrogen bond, which is a key factor determining the conformation of the receptor and thus its biological activity [52].

Discussion
The identification of new ROR/RORT ligands may have many potential applications in the future that seem to be overlooked by the world of science today.Due to the participation of the ROR isoform in the regulation of G6PC, agonists can be employed, e.g., in the treatment of nonalcoholic fatty liver disease to improve the functions of liver cells [68].In particular, the compound that we identified shows very low cytotoxicity (Fig. 12A).Th17 lymphocytes, which are implicated in cancers [69,70], may promote the tumor microenvironment in some types of cancers, e.g., breast cancer [71].However, Th17 lymphocytes can also participate in tumor eradication by producing large amounts of cytokines as well as stimulating infiltration of the microenvironment by dendritic cells, cytotoxic T lymphocytes or natural killer cells [72][73][74].Thus, 2-chloro-N-(2-chloro-3-methylphenyl)-5-nitrobenzamide, because of its low cytotoxicity, may potentially be employed in some anticancer regimens to increase not only the antitumor properties of Th17 lymphocytes but also the efficacy of the immune system.In recent years, Th17 lymphocytes have also attracted atten-  tion as an interesting component for adoptive cell therapy [75][76][77][78], in which cells are obtained from the patient, multiplied outside the body, and then reintroduced in large numbers into the bloodstream to destroy the tumor.The use of RORT ligands (including the newly identified 2-chloro-N-(2-chloro-3-methylphenyl)-5-nitrobenzamide) has the potential to promote ex vivo cell proliferation and to improve Th17 cell phenotypes.Recently, Lai et al. showed that the improvement in the generation of Th17 lymphocytes is beneficial in adoptive cell therapy in a murine model [79].Nevertheless, owing to potential interactions with the PXR/CYP3A4 signaling pathway, the prospective utilization of the identified compound or other benzamide derivatives should be approached with caution and necessitates further investigation.It is not possible to rule out the possibility that our findings may limit the potential use of such compounds in vivo.Successful experimental validation has proven that the developed predictive models are correct.The combination of open source resources, including the experimental data retrieved from the ChEMBL database as well as all applied libraries of the Python ecosystem, led to an efficient and robust computational protocol in the form of a Python program.This tool covers 6 commonly available machine learning methodologies and a set of auxiliary functionalities supporting data retrieval, cleaning and final processing.The workflows are managed by dedicated configuration files that allow for large-scale calculations.In summary, our cheminformatics approaches that used machine learning techniques allowed us to identify a new ligand of ROR/RORT receptors but with already known nitrobenzamide, [80] or a general benzamide scaffold [81], [82], [83], [84], [85].It should be noted that our approach was aimed at identifying ligands for ROR/RORT receptors, not specifically agonists or inverse agonists, which are sometimes distinguished by only one functional group [86] and can be utilized for any protein target whose activity is regulated by small molecular weight ligands.

Conclusions and future effort
Within the current study, we have introduced an efficient methodology that allows the creation of well-performing QSAR models based on commonly available machine learning methods.We have carefully examined 6 methodologies, and within each of them, we considered selected alternative approaches oriented on the experimental raw data processing and the way the chemical features were created.Among all considered predictive models, we selected the best performing classifier and regressor, which were later applied to the preselected compounds.The selection was oriented on the ZINC20 database containing ca. 880 million commercially available compounds.First, based on the molecular similarity between known ROR ligands and ZINC20 species, we selected ca.1700 candidates, which were at least 70% similar to biologically active ROR ligands in terms of the Tanimoto coefficient based on Morgan fingerprints.Second, these species were inspected with the best performing regressor and classifier, which led to a shortened list of candidates.Last, we have taken into account the commercial aspects and selected 2 species for final experimental validation, which showed that, despite the high similarity, only one of the two selected compounds demonstrated activity toward the ROR/RORT receptors.The percentage of success, in this case, is very high and indicates that the use of our methodology based on machine learning in conjunction with the experiment significantly improves the search for substances with the expected activity against various proteins.As a potential avenue for future efforts, we observe the development of generative approaches combined with multicriteria optimization techniques for molecular optimization.The two species, selected and experimentally verified within the current study, might be considered good starting structures for further research.In addition to biological activity, we also intend to take into account other objectives, such as solubility, toxicity, and organic synthesis difficulty.As a result, we will obtain the Pareto-optimal solutions, and similar to the current study, some of these solutions will be experimentally verified.

Fig. 1 .
Fig.1.The distribution of experimental data.The distribution of different experimental values (activities), expressed as IC50, EC50, efficacy, inhibition, etc. values, for the considered ROR ligands.The data were obtained from the ChEMBL 3.0 database[36].

Fig. 2 .
Fig. 2. The best classifiers.Bar plot showing the best classification predictive models.The quality of the models is expressed as the basic binary classifier quality measure, i.e. precision, recall, accuracy, f1-score, rocauc, average precision and Matthew coefficient.

Fig. 3 .
Fig. 3. Precision vs. recall plot of the best performing classifier.The test was performed with the dataset that was excluded prior to training thus, it reflects the objective measure of the predictive capabilities of the model.

Fig. 4 .Fig. 5 .
Fig. 4. The best classifiers.Bar plot showing the best regression predictive models.The quality of the models is expressed in terms of the  2 coefficient (R2), mean squared error (MSE), mean absolute error (MAE) and mean absolute percentage error (MAPE).The latter is reflected on the right vertical axis.

Fig. 6 .Fig. 7 .
Fig. 6.Distribution of classifier probabilities of the ZINC20 candidates.Histogram of the XGBoost classifier predictions of the ZINC20 candidate species.The distribution is approximately uniform with a slight bias toward the high probabilities.

Fig. 10 .
Fig. 10.The area of applicability of the models.The red points reflect the molecules contained in the training set, and the violet points are related to the ZINC20 candidates.The green circles illustrate the region of chemical space with two candidates chosen for further experimental study.

Fig. 11 .
Fig. 11.The final poses of chosen compounds.Both compounds were docked to the same ROR receptor structure (PDB ID: 4WPF), reflecting the agonistic conformation of this protein.

Fig. 12 .
Fig. 12.The biological effects of 2-chloro-N-(2-isopropyl-6-methylphenyl)-5-nitrobenzamid (2IP6MP) and 2-chloro-N-(2-chloro-3-methylphenyl)-5nitrobenzamide (2C3MP) on ROR-expressing HepG2 cells.A) Results of the viability of HepG2 cells treated with increasing concentrations of 2IP6MP and 2C3MP for 24 h; mean ± SD, n = 3. B) The effect of increasing concentrations of 2IP6MP and 2C3MP on ROR-HepG2 reporter cell line.Mean ± SD, n = 5.C). 2C3MP (20 μM) but not 2IP6MP (20 μM) potentiates the effect of human ROR in the ROR-HepG2 reporter cell line.Mean ± SD, n = 5; * denotes statistically significant difference at p<0.05 compare to the control treatment; # denotes a statistically significant difference at p<0.05 compare to the pCMV6-XL5 and 2C3MP treated cells.D) The effect of the 2C3MP (20 μM) and 2IP6MP (20 μM) on the expression of ROR-dependent gene, G6PC in human HepG2 cell line, determined by quantitative RT-PCR; mean ± SD, n = 4, * denotes statistically significant difference at p<0.05 compare to the control treatment.E) The effect of increasing concentrations of 2IP6MP and 2C3MP on the nhrtox-HepG2 reporter cell line.Mean ± SD, n = 5.F) Effect of 2C3MP (40 μM) and 2IP6MP (1 μM) on the expression of the PXR-dependent gene CYP3A4 in the human HepG2 cell line, as determined by quantitative RT-PCR; mean ± SD and n = 3; * denotes a statistically significant difference at p<0.05 compared to the control treatment.

Fig. 13 .
Fig. 13.Results of the cell viability of human Th17 cells treated with 2IP6MP and 2C3MP.Human CD4+ cells were isolated from buffy coats of healthy, anonymous donors and treated with increasing concentrations of 2IP6MP and 2C3MP upon Th17 for 5 days.After that time cell viability was determined using a CellTiter-Glo Luminescent Cell Viability Assay.The results are shown as the mean ± SD, n = 3 (three different donors); * denotes a statistically significant difference at p<0.05.

Fig. 14 .
Fig. 14.The impact of chosen compounds on gene expression.The effect of 2IP6MP (A) and 2C3MP (B) on the expression of the RORT-dependent genes: IL17A and IL17F in human Th17 cells.Primary CD4+ cells were isolated from buffy coats of healthy, anonymous donors and treated with increasing concentrations of 2IP6MP and 2C3MP upon Th17 for 5 days.The expression of cognate genes was determined by quantitative RT-PCR and the results are shown as dot plots with median values (bars) from seven independent donors (n=7); * indicates a statistically significant difference at p<0.05.
• Multilayer perceptron: MLP architecture (predefined architectures with a certain depth and number of neurons on each layer), activation function (rectified linear unit, scaled rectified linear unit, exponential linear unit, hyperbolic tangent), batch size, initial learning rate, minimum learning rate, regularization weight, optimizer,

Table 10
The results obtained for the regressors based on the ridge method.The presented results are for the randomly chosen hold-out set of the species.

Table 11
The results obtained for the regressors based on the multilayer perceptron architecture.The presented results are for the randomly chosen hold-out set of the species.

Table 13
The best-performing classifiers.

Table 14
The best-performing regressors.